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Abstract — We consider a combinatorial generalization of the 
classical multi-armed bandit problem that is defined as follows. 
There is a given bipartite graph of M users and N > M 
resources. For each user-resource pair there is an associated 
state that evolves as an aperiodic irreducible finite-state Markov 
chain with unknown parameters, with transitions occurring each 
time the particular user i is allocated resource j. The user 
i receives a reward that depends on the corresponding state 
each time it is allocated the resource j. The system objective 
is to learn the best matching of users to resources so that 
the long-term sum of the rewards received by all users is 
maximized. This corresponds to minimizing regret, defined here 
as the gap between the expected total reward that can be 
obtained by the best-possible static matching and the expected 
total reward that can be achieved by a given algorithm. We 
present a polynomial-storage and polynomial-complexity-per-step 
matching-learning algorithm for this problem. We show that this 
algorithm can achieve a regret that is uniformly arbitrarily close 
to logarithmic in time and polynomial in the number of users and 
resources. This formulation is broadly applicable to scheduling 
and switching problems in networks and significantly extends 
prior results in the area. 

I. Introduction 

Multi-armed bandit problems provide a fundamental ap- 
proach to learning under stochastic rewards, and find rich 
applications in a wide range of networking contexts, from 
Internet advertising (T) to medium access in cognitive radio 
networks [2|-[4|. In the simplest, classic non-Bayesian version 
of the problem, studied by Lai and Robbins [5|, there are K 
independent arms, each generating stochastic rewards that are 
i.i.d. over time. The player is unaware of the parameters for 
each arm, and must use some policy to play the arms in such a 
way as to maximize the cumulative expected reward over the 
long term. The policy's performance is measured in terms of 
its "regret", defined as the gap between the the expected reward 
that could be obtained by an omniscient user that knows the 
parameters for the stochastic rewards generated by each arm 
and the expected cumulative reward of that policy. It is of 
interest to characterize the growth of regret with respect to 
time as well as with respect to the number of arms/players. 
Intuitively, if the regret grows sublinearly over time, the time- 
averaged regret tends to zero. 

There is inherently a tradeoff between exploration and 



exploitation in the learning process in a multi-armed bandit 
problem: on the one hand all arms need to be sampled 
periodically by the policy used, to ensure that the "true" best 
arm is found; on the other hand, the policy should play the arm 
that is considered to be the best often enough to accumulate 
rewards at a good pace. 

In this paper, we formulate a novel combinatorial gener- 
alization of the multi-armed bandit problem that allows for 
Markovian rewards and propose an efficient policy for it. In 
particular, there is a given bipartite graph of M users and 
N > M resources. For each user-resource pair there 
is an associated state that evolves as an aperiodic irreducible 
finite-state Markov chain with unknown parameters, with 
transitions occurring each time the particular user i is allocated 
resource j. The user i receives a reward that depends on 
the corresponding state each time it is allocated the resource 
j. A key difference from the classic multi-armed bandit is 
that each user can potentially see a different reward process 
for the same resource. If we therefore view each possible 
matching of users to resources as an arm, then we have a 
super-exponential number of arms with dependent rewards. 
Thus, this new formulation is significantly more challenging 
than the traditional multi-armed bandit problems. 

Because our formulation allows for user-resource matching, 
it could be potentially applied to a diverse range of networking 
settings such as switching in routers (where inputs need to 
be matched to outputs) or frequency scheduling in wireless 
networks (where nodes need to be allocated to channels) or 
for server assignment problems (for allocating computational 
resources for various processes), etc., with the objective of 
learning as quickly as possible so as to maximize the usage 
of the best options. For instance, our formulation is general 
enough to be applied to the channel allocation problem in 
cognitive radio networks considered in [2] if the rewards for 
each user-channel pair come from a discrete set and are i.i.d. 
over time (which is a special case of Markovian rewards). 

Our main contribution in this work is the design of a 
novel policy for this problem that we refer to Matching 
Learning for Markovian Rewards (MLMR). Since we treat 
each possible matching of users to resources as an arm, the 
number of arms in our formulation grows super-exponentially. 
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However, MLMR uses only polynomial storage, and requires 
only polynomial computation at each step. We analyze the 
regret for this policy with respect to the best possible static 
matching, and show that it is uniformly logarithmic over time 
under some restrictions on the underlying Markov process. We 
also show that when these restrictions are removed, the regret 
can still be made arbitrarily close to logarithmic with respect 
to time. In either case, the regret is polynomial in the number 
of users and resources. 

The rest of the paper is organized as follows. In sectionlTflwe 
present our work in the context of prior results on multi-armed 
bandits. In section [III] we present the problem formulation. In 
section [IV] we present a polynomial-storage polynomial-time- 
per-step learning policy, which we refer to as MLMR. We 
analyze the regret for this policy in section [V] and show that 
it yields a bound on the regret that is uniformly logarithmic 
over time and polynomial in the number of users and resources 
under certain conditions on the Markov chains describing 
the state evolution for the arms. We then show that the 
regret can still be arbitrarily close to logarithmic with respect 
to time when no knowledge is available. We present some 
examples and simulations in section |VI| and conclude with 
some comments and ideas for future work in section IVIII 

II. Prior Work 

The problem we consider in this paper is different from 
prior work for two key reasons. We treat rewards that are 
dependent across a super-exponential number of arms whose 
states evolve in a non-i.i.d. Markovian fashion over time. We 
summarize below prior work, which has treated a) independent 
and temporally i.i.d. rewards, or b) independent and Markovian 
state-based rewards, or c) non-independent arms with tempo- 
rally i.i.d. 

A. Independent arms with temporally i.i.d. rewards 

The work by Lai and Robbins J5| assumes K independent 
arms, each generating rewards that are i.i.d. over time from 
a given family of distributions with an unknown real-valued 
parameter. For this problem, they present a policy that provides 
an expected regret that is 0(K log n), i.e. linear in the number 
of arms and asymptotically logarithmic in n. Anantharam et 
al. extend this work to the case when M simultaneous plays 
are allowed (6). The work by Agrawal [7| presents easier 
to compute policies based on the sample mean that also has 
asymptotically logarithmic regret. The paper by Auer et al. (8) 
that considers arms with non-negative rewards that are i.i.d. 
over time with an arbitrary un-parameterized distribution that 
has the only restriction that it have a finite support. Further, 
they provide a simple policy (referred to as UCB1), which 
achieves logarithmic regret uniformly over time, rather than 
only asymptotically. Our work utilizes a general Chernoff- 
Hoeffding-bound-based approach to regret analysis pioneered 
by Auer et al. 

Some recent work has shown the design of distributed 
multiuser policies providing asymptotically logarithmic regret, 
for the context of cognitive radio networks |3], |4). 



B. Independent arms with Markovian rewards 

There has been relatively less work on multi-armed bandits 
with Markovian rewards. Anantharam et al. [9] wrote one of 
the earliest papers with such a setting. They proposed a policy 
to pick m out of the N arms each time slot and prove the lower 
bound and the upper bound on regret. However, the rewards 
in their work are assumed to be generated by rested Markov 
chains with transition probability matrices defined by a single 
parameter 8 with identical state spaces. Also, the result for the 
upper bound is achieved only asymptotically. 

For the case of single users and independent arms, a recent 
work by Tekin and Liu iflOl has extended the results in 
||9l to the case with no requirement for a single parameter 
and identical state spaces across arms. They propose to use 
UCB1 from (8| for the multi-armed bandit problem with 
rested Markovian rewards and prove a logarithmic upper 
bound on the regret under some conditions on the Markov 
chain. We use elements of the proof from [10] in this work, 
which is however quite different in its combinatorial matching 
formulation (which allows for dependent arms). The work on 
restless Markovian rewards with single users and independent 
arms could be found in IfTTIl - lfLlll . 

C. Dependent arms with temporally i.i.d. rewards 

The paper by Pandey et al. [ 1 ] divides arms into clusters 
of dependent arms, each providing binary rewards, but they 
do not present any theoretical analysis on the expected regret. 
In fl4l . the reward from each arm is modeled as the sum 
of a linear combination of a set of static random numbers 
and a zero-mean random variable that is i.i.d. over time and 
independent across arms. This is quite different from our 
model of rewards. 

Our work in this paper is closest to and builds on the recent 
work which introduced combinatorial multi-armed bandits Q . 
The formulation in [2| has the restriction that the reward pro- 
cess must be i.i.d. over time. A polynomial storage matching 
learning algorithm is presented in [2] that yields regret that is 
polynomial in users and resources and uniformly logarithmic 
in time for the case of i.i.d. rewards. Although i.i.d. rewards are 
a special case of Markovian state-based rewards, one reason 
this work is not a strict generalization of O is our assumption 
that the number of possible states, and hence the support of the 
reward distribution on each arm, is finite (whereas [2| allows 
for continuous reward distributions with bounded support). 

III. Problem Formulation 

We consider a bipartite graph with M users and N > M 
resources predefined by some application. Time is slotted and 
is indexed by n. At each decision period (also referred to 
interchangeably as time slot), each of the M users is assigned 
a resource with some policy. 

For each user-resource pair (i, j), there is an associated state 
that evolves as an aperiodic irreducible finite-state Markov 
chain with unknown parameters. When user i is assigned 
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resource j, assuming there are no other conflicting users as- 
signed this resource, i is able to receive a reward that depends 
on the corresponding state each time it is allocated the resource 
j. We denote the state space as Sy = {z\, Z2, ■ . . , Z\St -\}- 
The state of the Markov chain for each user-resource pair 
(i, j) evolves only when resource j is allocated to user i. We 
assume the Markov chains for different user-resource pairs are 
mutually independent. The reward got by user i while allocated 
resource j on state z € Sij is denoted as l z ' J , which is also un- 
known to the users. We denote Pjj = {pi,j(z a , Zb)}z a ,z b eSij 
as the transition probability matrix for the Markov chain 
Denote 7r* J as the steady state distribution for state z. The 
mean reward got by user i on resource j is denoted as fiij. 
Then we have /iy = $z~' 7r z"' - The set of all mean 

rewards is denoted as = {/zy}. 

We denote Yij (n) as the actual reward obtained by a user i 
if it is assigned resource j at time n. We assume that Yij (n) = 

j , if user i is the only occupant of resource j at time n 
where z(n) is the state of Markov chain associated with 
at time n. Else, if multiple users are allocated resource j, 
then we assume that, due to interference, at most one of the 
conflicting users j' gets reward Yi t ji(n) — d'j, where z'(n) 
is the state of Markov chain associated with (i,f) at time n, 
while the other users on the resources j 7^ j' get zero reward, 
i.e., Yi j(n) = 0. This interference model covers scenarios in 
many networking settings. 

A deterministic policy a(n) at each time is defined as a map 
from the observation history {Ot}"~^ to a vector of resources 
o(n) to be selected at period n, where O t is the observation at 
time t; the i-th element in o(n), o, (n), represents the resource 
allocation for user i. Then the observation history {Ot}™=i 
in turn can be expressed as { £>*(*), ^ l0< (t)(t)}i<»<M,i<t<n- 

Due to the fact that allocating more than one user to a re- 
source is always worse than assigning each a different resource 
in terms of sum-throughput, we will focus on collision-free 
policies that assign all users distinct resources, which we will 
refer to as a permutation or matching. There are P(N, M) 
such permutations. 

We formulate our problem as a combinatorial multi-armed 
bandit, in which each arm corresponds to a matching of the 
users to resources. We can represent the arm corresponding to 
a permutation k (1 < k < P(N, M)) as the index set Ak = 
'■ is in permutation k}. The stochastic reward for 
choosing arm k at time n under policy a is then given as 



Y a ( n )(n) 



E k mW 



E 



Note that different from most prior work on multi-armed 
bandits, this combinatorial formulation results in dependence 
across arms that share common components. 

A key metric of interest in evaluating a given policy for this 
problem is regret, which is defined as the difference between 
the expected reward that could be obtained by the best-possible 
static matching, and that obtained by the given policy. It can 



be expressed as: 

R a (n) 



nn* - E a ^Y a{t) {t)] 

t = l 

n 

n/i*-^E E C ( 



(1) 



where fi* — max '> m e expected reward of the 

optimal arm, is the expected sum-weight of the maximum 
weight matching of users to resources with /ijj as the weight. 

We are interested in designing policies for this combinato- 
rial multi-armed bandit problem with Markovian rewards that 
perform well with respect to regret. Intuitively, we would like 
the regret R a {n) to be as small as possible. If it is sub-linear 
with respect to time n, the time-averaged regret will tend to 
zero. 

IV. Matching Learning for Markovian Rewards 

A straightforward idea for the combinatorial multi-armed 
bandit problem with Markovian rewards is to treat each match- 
ing as an arm, apply UCB1 policy (given by Auer et al. 0) 
directly, and ignore the dependencies across the different arms. 
For each arm k, two variables are stored and updated: the time 
average of all the observation values of arm k and the number 
of times that arm k has been played up to the current time slot. 
The UCB1 policy makes decisions based on this information 
alone. 

However, there are several problems that arise in applying 
UCB1 directly in the above setting. We note that UCB1 
requires both the storage and computation time that are linear 
in the number of arms. Since the number of arms in this 
formulation grows as P(N, M), it is highly unsatisfactory. 
Also, the upper-bound of regret given in [10] will not work 
anymore since the rewards across arms are not independent 
anymore and the states of an arm may involve even when this 
arm is not played. No existing analytical result on the upper- 
bound of regret can be applied directly in this setting to the 
best of our knowledge. 

So we are motivated to propose a policy which more effi- 
ciently stores observations from correlated arms and exploits 
the correlations to make better decisions. Our key idea is to 
use two M by N matrices, (^j)mxJV an d (n>i,j)MxN, to 
store the information for each user-resource pair, rather than 
for each arm as a whole, fty is the average (sample mean) 
of all the observed values of resource j by user i up to the 
current time slot (obtained through potentially different sets 
of arms over time), is the number of times that resource 
j has been assigned to user i up to the current time slot. 

At each time slot n, after an arm k is played, we get the 
observation of Yi t j(n) for all £ Ak- Then (9i t j)MxN 

and (riij)MxN (both initialized to at time 0) are updated 
as follows: 



.j(n-l) + l 



,(n-l) 



, if e A k 
, else 



(2) 
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i(n) 



(3) 



riij(n - 1) + 1 , if G A 
riij(n — 1) , else 

Note that while we indicate the time index in the above 
updates for notational clarity, it is not necessary to store the 
matrices from previous time steps while running the algorithm. 

Our proposed policy, which we refer to as Matching Learn- 
ing for Markovian Rewards, is shown in Algorithm [T] 

Algorithm 1 Matching Learning for Markovian Rewards 
(MLMR) 

// Initialization 
for p = 1 to M do 
for q = 1 to N do 
n = (M — l)p + q; 

Play any permutation k such that (p, q) E Ak', 
Update 0i,j)MxN, (nij)MxN accordingly, 
end for 
end for 
// Main loop 
while 1 do 
ii = ii + 1; 

Solve the Maximum Weight Matching problem (e.g., 
using the Hungarian algorithm lfT31 ) on the bipar- 
tite graph of users and resources with edge weights 



L In n 



MxN 



to play arm k that maximizes 




(4) 



where L is a positive constant. 

Update (§ij) M xN, (n>i,j)MxN accordingly. 



13: 



14: end while 



V. Analysis of Regret 

We summarizes some notation we use in the description and 
analysis of our MLMR policy in Table [Q 

The regret of a policy for a multi-armed bandit problem is 
traditionally upper-bounded by analyzing the expected num- 
ber of times that each non-optimal arm is played and then 
taking the summation over these expectation times the reward 
difference between an optimal arm and a non-optimal arm all 
non-optimal arms. Although we could use this approach to 
analyze the MLMR policy, we notice that the upper-bound for 
regret consequently obtained is quite loose, which is linear 
in the number of arms, P(N,M). Instead, we present here a 
novel analysis for a tighter analysis of the MLMR policy. Our 
analysis shows an upper bound of the regret that is polynomial 
in M and N, and uniformly logarithmic over time. 

The following lemmas are needed for our main results in 
Theorem [T] 

Lemma 1: (Lemma 2.1 from (9)) {X n ,n = 1,2,...} is 
an irreducible aperiodic Markov chain with state space S, 
transition matrix P, a stationary distribution tt z , Vz G S, and 



N : 
M : 
k : 

i,3 '■ 
A k : 

* : 



1 z 

Qi,3- 



Ii': 

A fc : 
A 



mm ■ 



'max- 
^min • 
^max- 
^ mili- 



um ax- 
^min- 

T k (n): 
t fe (n): 



9 fc h 



number of resources. 

number of users, M < N. 

index of a parameter used for an arm, 

1 < k < P{N, M). 

index of a parameter used for user i, resource j. 
'■ is in permutation k} 

{A k : G A k } 

index indicating that a parameter is for the 
optimal arm. If there are multiple optimal arms, 
* refers to any of them, 
number of times that resource j has been 
matched with user i up to the current time slot, 
average (sample mean) of all observed values 
of resource j by user i up to current time slot. 
riij such that € Ak at current time slot, 
state space of the Markov chain for 
user-resource pair 

transition matrix of the Markov chain 
associated with user-resource pair 
steady state distribution for state z of the 
Markov chain associated with 
reward obtained by user i while access 
resource j on state z G Sij. 

O^ 3 nl' 3 ' , the mean reward for user i using 

zes i 

resource j 

E ^ J 

max ii l,J 
k (iJ)eA k 

- Mfe- 

min Afc. 

fc:/ife</i* 

max Afc. 

k 

min 7r* ' J . 

l<i<M,l<j<N,zeSi :j 



max 

l<i<M,l<j<N 

min 

l<i<M,l<j<N 

max i 

l<i<M,l<j<N,zeSij 



J %,3 I 



)'•:) 



Qh3 



mm i 

l<i<M,l<j<N,z£Si d " 

eigenvalue gap, defined as 1 — A2, where A2 
is the second largest eigenvalue of Pij- 

max €4 j. 

l<i<M,l<j<N ' J 

min ei 

l<i<M,l<j<AT ' J 

number of times arm k has been played by 
MLMR in the first n time slots. 

@i,j{ n )- ^ i s m e summation of all the 

(i,3')£A k 

average observation values in arm k at time n. 
9i,j(n) such that G Ak and mjfo) — n k . 
M 

,k : / 6. k . 



TABLE I 
Notation 
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an initial distribution q. Denote Ft as the cr-algebra generated 
by Xi, X2, ■ ■ ■ , X t . Let G be a cr-algebra independent of 
F = Vt>iF t . Let r be a stopping time with respect to the 
increasing family of cr-algebra GV F t ,t > 1. Define N(z,t) 

such that N(z, r) = J2 I ( x t = z )- Then. 
t=i 



\E[N(z,t)-7t z E[t]]\<A p , 



(5) 



for all q and all r such that E[t] < 00. Ap is a constant that 
depends on P. 

Lemma 2: (Corollary 1 from |10|) Let 7r m j n be the mini- 
mum value among the stationary distribution, which is defined 

as 7r m in = min7r z . Then Ap < 1 / 7r m i n . 

Lemma 3: For user-resource matching, if the state of re- 
ward associated with each user-resource pair is given 
by a Markov chain, denoted {Xj' 3 ,Xl' 3 , ■ ■ .}, satisfying the 
properties of Lemma [T] then the regret under policy a is 
bounded by: 



P(N,M) 

R a (n)< (V* -H k )E a [TZ(n)}+A s , F>B , (6) 

k=l 



where As,p.e is a constant that depends on all the 
state spaces {Sij}i<i<M,i<i<N, transition probability ma- 
trices {Pi,j}i<i<M,i<i<w and the rewards set {9fj,z £ 

Si,j}l<i<M,l<i<N- 

Proof: 

VI < i < M, 1 < j < N, define G itj = \/ k ^iMj F k,i 
where Fj-i — V t >iEt' 3 , which applies to the Markove chain 
{Xl' 3 , X% 3 ,■■■}■ We note that the Markove chains of different 
user-resource pairs are mutually independent, so Vi,j, G 1,3 is 
independent of F^j, Fij satisfies the conditions in Lemma Q] 
Note that Tfj(n) is a stopping time with respect to {Gij V 

Since the state of a Markove chain evolves only when it is 
observed, X % { 3 , . . . , X^l ^ represents the successive states 
of the Markov chain up to n when assigning resource j to 
user i.Then the total reward obtained under policy a up to 
time n is given by: 



n N M T ij(") 



t=i 



j=l i=i ;=i ze Si. 



Note that Vi = 1, . . . , M, Tg(ri) = T* {nhl where T* (n) ' 1 is 
the number of times up to n that the i-th component has been 
observed while playing arm k, and there exist one resource 
index j such that £ Ah- So, we have: 



P(N,M) 

£ H k E a [T?(n)] 
fe=i 

P(N,M) M 

= E X> fc £aP?(n)] 

k=l i=l 
P(N,M) M 

= E E^^r'w] 

fc=l i=l 
N M 

= EE^J E ^PT' 1 ^)] 

3=1 i=l ^ fc eKi,j 
AT A/ 

= EEft^«^(")i 

AT A/ 

= EE E 0^^E a [T^{n)] 



Hence, 



P(N,M) 

\R a (n)- ~ V k )E a {T£(n)}\ 

k=l 

P(N,M) 

k=l 

n 

(nfj,* - E a [J2Y a{t) (t)}) 
t—i 
P(N,M) 

-(n/x*- £ /^«P?(n)]) 

k=l 

n P(N,M) 

t=i fe=i 

AT M 

^EE E E ^'W = -)] 

j=i i=i 1=1 zeSi.j 

Ar M 

-EE E ^x^k>) 

AT M T "» 

<EEEi £ «tE A«/(jt/ J =*)]. 

j=l i=i zeSt.j 1=1 

-0U*?E a [I% s {n)]\ 

N M Tt^n) 

= EE E Wu£ W w = *)i 

3=1 i=i zeSij 1=1 
-n^E a [T^(n)}\ 

N M 

= EE E 0^\E a [N(z,T t ^(n))}-n^E^(n) 
j=i i=i zeSij 



6 



Based on Lemma [T] we have: 

P(N,M) 

\R a (n)- (H* ~ » k )E a [T£(n 

fc=i 

JV M 

^EE E <&C PiJ =Asr,e. 
3 -=i ,-=i zes 4J 



(8) 



Lemma 4: (Theorem 2.1 from 1(161 ') Let {X n ,n — 
1,2,...} be an irreducible aperiodic Markov chain with finite 
state space S, transition matrix P, a stationary distribution ir z , 
Vz 6 S, and an an initial distribution q. Let iV q = IK' 22 -), z E 
S\\2- The eigenvalue gap e is defined as e = 1 — A2, where 
A2 is the second largest eigenvalue of the matrix P. \/A C S, 
define ^(n) as the total number of times that all states in the 
set A are visited up to time n. Then V7 > 0, 



7e 



P{t A (n)-nn A > 7) < (1 + — „ q 



iV n e 



-7 2 e/20n\ 



(9) 



where it a — 7r ^- 

Our main results on the regret of MLMR policy are shown 
in Theorem Q] We show that with using a constant L which 
is bigger than a value determined by the minimum eigenvalue 
gap of the transition matrix, maximum value of the number of 
states, and maximum value of the rewards, our MLMR policy 
is guaranteed to achieve a regret that is uniformly logarithmic 
in time, and polynomial in the number of users and resources. 
Theorem 1: When using any constant L > 



(50+40M)gg 



s the expected regret under the MLMR 



policy specified in Algorithm Q] is at most 



4M 3 NL In-, 
(A min ) 2 

M 2 N W [ 



■ MN- 



lO^min^min 



(10) 



A 



s,p,e, 



where A„ 



'mini L niaxi 



follow the definition in Table [TJ As,p.e follows the definition 
in Lemma |3] 
Proof: 



Denote Ct n 



as 



E 



Lint 



M 

E 



Lint 



Lint 



Denote C tnjl 



M 



^2 C tn k. It is also 



denoted as C t r n k n k \ sometimes for a clear explanation in 
this proof. 

We introduce Tij(n) as a counter after the initialization 
period. It is updated in the following way: 

At each time slot after the initialization period, one of the 
two cases must happen: (1) an optimal arm is played; (2) a 
non-optimal arm is played. In the first case, {Tij(n))MxN 
won't be updated. When an non-optimal arm k(n) is picked 
at time n, there must be at least one E Ak such that 



n id (n) 



(u,ji)e-4k 



If there is only one such arm, 



Tij(n) is increased by 1. If there are multiple such arms, 
we arbitrarily pick one, say and increment Tyy by 1. 

Each time when a non-optimal arm is picked, exactly one 
element in (Tij(n))MxN is incremented by 1. This implies 
that the total number that we have played the non-optimal arms 
is equal to the summation of all counters in (Tij(n))MxN ■ 
Therefore, we have: 

M N 

E[T fc (n)] =]T (11) 

k:nk<li* i=l j=l 

Also note for Tjj(n), the following inequality holds: 

Ti,j(n) < n tJ (n),\/l < i < M, 1 < j < N. (12) 

Denote by Iij(n) the indicator function which is equal to 
1 if Tij(n) is added by one at time n. Let I be an arbitrary 
positive integer. Then: 



T itj (n)= HliAt)} 

t=MN+l 
n 

<l+ lfe(*)>^-(i-l)>0 

t=MN+l 

where 1 (x) is the indicator function defined to be 1 when the 
predicate x is true, and when it is false. 

When Ii,j(t) = 1, there exists some arm such that a non- 
optimal arm is picked for which riij is the minimum in this 
arm. We denote this arm as k(t) since at each time that 
h,j{t) = 1, we may get different arms. Then, 

T i;i (n)<Z+ £ l#(t-l) + C t _ 1)n . ( t-i) 

t=MN+l 



< 



h(t-i)(t- l) + C t _i :nAfc(t _ i) ( t _i),T iJ (t- 1) > 1} 



= 1+ E i{0*(t) + c t , n . (t ) 

t=MN 

<Mt)(t) + c t ^ AHt){t) ,f itj {t)>i}. 

Based on (O, I < f itj (t) implies, 



I < T id (t) < n id (t) = 



k(t) 



So, 



VI < i < Af,n fc(t) > I. 



Then we could bound Tij(n) as, 

t—MN 0<r!*,...,n* f <t 

+C+ („» „» 1 < max S.,,, fc(t) k(t) 

+ C. , fcft) Jfc(t) \ } 

00 t t t t 

<i+ZlZ ••• E E ••• E 



t=l«I=l nJ f =l„*W = j „M*) =i 



l{0*n*,...,n^ + Ct,(nl,...,n* M ) 



+C . / fc(t) fc(t) \ r • 

t,(V 1— >»M ) JJ 



JVT ) — fc(t) ,n 1 , . . .,n 



fc(t) k(t) 
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n: ,...,n. 



+ Ct,(nJ,. ..,n* M ) < „*W „ fc <*> + 



fcfi),^ ,...,n A 



C , k(t) means that at least one of the following 

t,( i n 1 ,...,n M J 

must be true: 

0*nl,...,n* M < - Ct.K,...^), (13) 
^*(t),n*W,..., w *<*' ^ ^ fe W + C t,(n{"» < 14 > 

/i* < Mfc(t) +2C ti(n * (t)i ..._„*(*))• (15) 

Here we first find the upper bound for Pr{0* n * i ...,„* < 
- Ct,K,...,n^)}: 

Pr{F n .,.„, n j, < M * - C t , K ,...,„y } 

M A A/ Af 

i— 1 i— 1 z— 1 

< Pr{ At least one of the following must hold: 

0*,nj - Ml ~ C*,"* ! 



VI < z < \S*\, applying Lemma |4j we could find the upper 
bound of each probability in ( TT6I ) as, 



Pr{6 iK <fi*-C tK } 
|s* 



< 



< 



E 1 



z=l 

is; i 



'Lint 



1 /v P iojsTpefivi^f 



u 



<EPr{0*. n *<ti-C t , nt }. 

i=l 

VI < i < M, 
Pr{6i, K <MJ-C t ,„j} 

= p n 2^ 1 - ^ ( z ) ~ Ct > n *i J 

2=1 * 2 = 1 

IS* 

Z=l 

< Pr{At least one of the following must hold: 

n 

0*(1)<(1) - <0*(1K(1) < -^C t ,„., 



^(|5*|K(|5*|)-<^(|5;|)<(|^|)<-^C t , n .}, 

is* I 



2=1 
S* 



^PrK(z)-«W< 



2=1 
S* 



\S!\e*(z 



E^{« - $>*(/)) - n*(l - X>?(*)) 

Z^z l^z 



z=l 

< 

S* 



Ct.n* } 



= J2Pr{J2n*(l)-n*J2^)>Ts^ C ^}- 

2=1 Z^z Z#z IM<W) 



Eh 



z = l \ 

< £___Vt 1 1 



lOSniin^min 



^max n I - e max V L i ^ — tj^— ;— ;;- 

S max 



10s 



(17) 



mm 17 mm 



i 



7T 



min 



where (fTTl t holds since for any q^, 



1z 3 



< 



\S*j\ 

E 



|S»j| 

^ E 

2 2=1 
1 



Thus, 



Pr{6* n .,..., n - M <8*-C t , (nh ..., nh) } 



Ms n 



^ — 'max IX ' iU;!X 



vT 



(18) 



lO^min^min 



With the similar calculation, we can also get the upper 
bound of the probability for ([T4| >: 

Pr ^jk(t),T.* w ,...,n*<* ) - ^ fc + C t,(^ w ,.. .,n k ^)} 

M 



\s«\ 



> 



E^(*K"(*) + ^} 



i=l ar=l n i z=l 

A/ IS? | fe 

i=l 2=1 I" 5 * 1 



M IS? | 

EE Pr K fe (z)-™M(*)> 

i=l 2=1 
M 



\sm(z) 



^ ^ max / j , ''iiiiix 



i=l 

Ms,, 



lOSminfl 



7Tn 



10s 



/T \ ie min -10s 2 e 2 
< VjL ] f 20a 2 , ax 9g nax 



mm w mm 



(16) 



(19) 



8 



Note that for I > 



4L In n 



M N 



M rzfri 



- Mfc(t) - 2 E 



i=l 



fe(t) 



(20) 



I 4.L In n / A 



_ 4£lnn \ M 
= M* - Mfc(t) ~ ^fc(t) = 0- 
d20l > implies that condition ( fT5b is false when I 



4L In? 



If we let I 



4Lln 



then ([T5l > is false 



for all k{t), 1 < t < oo where 

A nl =min{A fe : e A k }. 

k 

Therefore, 

m :j (n)} 



(21) 



< 



AL In j 



OO / t t t 



EE- E E- E 

t=l \nj = l nJ=Af„fc = i «*=M 



2M w . j 




where (1221 holds since i > 
So under our MLMR policy, 

P(N,M) 

Rn(n)< (M*-M fc )£UT*(n)]+A s ,P,e 
fc=l 

= ^ A fc E[T fc (n)] +A s ,P,e 

fc:0 fc <0* 

<A max E[T fc (n)]+43,P,e 

k:8 k <e* 
M N 

= A max J2 E E feN + A s,p,e 
i=i j=i 



(23) 




< 



+ M £max | l 



A 



AM 3 NL Inn 



s,p,e 




< 



-r A r 



A 



s,p,e- 



(24) 



Theorem Q] shows when we use a constant L which is 
large enough such that L > ( J °+ 40M ) 6> m a x ;i ma JI ^ me re g re t f 
Algorithm [T] is upper-bounded uniformly over time n by a 
function that grows as 0(M 3 iVlnn). However, when # max , 
Smax or e m i n is unknown, the upper bound of regret could not 
be guaranteed to grow logarithmically in n. 

So when no knowledge about the system is available, 
we extend the MLMR policy to achieve a regret that is 
bounded uniformly over time n by a function that grows as 
0(M 3 NL(n) Inn), by using any arbitrarily slowly diverging 
non-decreasing sequence L(n) such that L(n) < n for any 
n in Algorithm Q] Since L(n) could grow arbitrarily slowly, 
the MLMR could achieve a regret arbitrarily close to the 
logarithmic order. We present our analysis in Theorem [2] 

Theorem 2: When using any arbitrarily slowly diverging 
non-decreasing sequence L(n) (i.e., L{n) — > oo as n — > oo) 
in such that Vn, L(n) < n, the expected regret under the 
MLMR policy specified in Algorithm Q] is at most 



AM 3 NL{n)\nn 
2 h Ml\tSs,p,e 



(A min )^ 



+M 2 JV= 



lO^min^min 



^4s,p.e, 



(25) 

where Bs,p.e is a constant that depends on # max , s max and 
Proof: 

Denote C t) „ as \J m n lat ■ Denote C t , nAfe as 

E \/ L( n lnt - Then replacing L with in the 

proof of Theorem [T] (till) to (1211) still stand. The upper bound 
of E[T tJ (n)] in <|22]) should be modified as in ( l/oD . 

is a diverging non-decreasing sequence, 
so there exists a constant tu such that for all 

t > h, L(t) > (60+40 f )fl — s — , which implies 



£ 20s max e m; 

Thus, we have 



< r 



9 



E[T M (n)] < 



AL{n) Inn 

w 



t t t t 

El E- E E- E ^ 

t=l \n* = l nJ=M n fc=l n*=Af 



-L(t)e min -10s^ lax 9^ la , 



10s mi 



< 4:M 2 L(n) Inn J M £max 



< 



V mm / 

4M 2 i(n) Inn 

V mm J 



lOs mm mm 



+ 1 + M- 



10s 



min^min 



oo 

E 2 * 



min^mm 



L(t)e min -(40M + 10)s 
20sLv€ a v 



i(t)e min -(40M+10)s 
: ^ W2 



20s£ 1 „„(9 



(26) 



E[Ti,j(n)] < 



4M 2 L(n)lnn 




where -Bs,p,e is a constant as shown in 

On $max5 ^max and C m i n . 

Then for the MLMR policy with L(n), 

M N 

max 

*=1 J=l 



s,p,e 
(27) 

which depends 



< 



4M 3 iVL(n)lni 



(A min ) 

S max 



MNB s ,p^ 



1 



10s 



mm^mm 



A r 



A 



s.p.e- 



(29) 



VI. Examples and Simulation Results 

We consider a system that consists of M = 2 users and 
N = 4 resources. The state of each resource evolves as an 
irreducible, aperiodic Markov chain with two states "0" and 
"1". For all the tables in this section, the element in the i-th 
row and j-th column represents the value for the user-resource 
pair The transition probabilities are shown in the tables 

below: 
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The rewards on each states are: 
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For 1 < i < M, 1 < j < N, the stationary distribution 
of user-resource pair on state "0" is calculated as 

; the stationary distribution on state "1" is calculated 



gio 



Poi +Pib 



Poi 



-j. The eigenvalue gap is eij 



p l oi +pw- The 



Poi +Pio 

expected reward /Ltjj for all the pairs can be calculated as: 



0.6909 


0.3909 


0.4333 


0.425 


0.3363 


0.4429 


0.6615 


0.4909 



We can see that the arm {(1, 1), (2, 3)} is the optimal arm 
with greatest expected reward fj,* = 0.6909+0.6615 = 1.3524. 
A min = 0.1706. 




Fig. 1. Simulation Results of Example 1 with A n 



0.1706 



Figure Q] shows the simulation result of the regret (normal- 
ized with respect to the logarithm of time) for our MLMR 
policy for the above system with different choices of L. 
We also show the theoretical upper bound for comparison. 
The value of L to satisfy the condition in Theorem Q] is 

L > {5 ° +4 °t2T S '™ x = 303 ' so we picked L = 303 in ±e 
simulation. 

Note that in the proof of Theorem Q] when L < 

— > —2. This im- 



-(40M+10)s„ 



plies 2t 

t=i 



does not converge any- 



more and thus we could not bound E[Tjj(n)] any more. 
Empirically, however, in Q] the case when L = 2 also seems 



10 



g S ,P,e = 1 + A/— I 1 + ^ ) E 2 * 20s »- 9 »- 5 - (28) 



to yield logarithmic regret over time and the performance is 
in fact better than L = 303, since the non-optimal arms are 
played less when L is smaller. However, this may possibly be 
due to the fact that the cases when Tij(n) grows faster than 
ln(i) only happens with very small probability when L = 2. 

Table II shows the number of times that resource j has been 
matched with user i up to time n = 10 7 . 



999470 
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185 


196 


136 


293 


999155 


420 


n^(10 7 ), L = 2 



892477 


30685 


39410 


37432 


26813 


50341 


850265 


72585 



TABLE II 

























L = 2 

L = 303 | 

Theoretical Upper Bound 



Fig. 2. Simulation Results of Example 2 with A„ 



0.0091 



Figure |2] shows the simulation results of the regret of 
another example with the same transition probabilities as in 
the previous example and different rewards on states as below: 
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0.45 
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#0 #1 

The expected reward /Lt,j for all the pairs can be calculated 



as: 



0.5636 


0.4091 


0.5933 


0.4875 


0.6227 


0.5714 


0.6615 


0.4954 



{(1, 1), (2, 3)} is still the optimal arm. However, compared 
with the previous example, we can see that the expected 
reward of three other arms {(1, 3), (2, 1)}, {(1, 3), (2, 2)}, 
{(1, 1), (2, 2)} are all very close to the expected reward of 
the optimal arm. For this example, A m ; n = 0.0091, which 
is much smaller compared with the previous example. In this 



case, the non-optimal arms are played much more compared 
with the previous example. This is because we have several 
arms of which the expected rewards are very close to /i*, so 
the policy has to spend a lot more time to explore on those 
non-optimal arms to make sure those are non-optimal arms. 
This fact can be seen clearly in Table III, which presents the 
number of times that resource j has been matched with user 
i up to time n = 10 7 under both cases when L = 2 and 
L = 303. 



817529 


544 


179832 


2099 


175583 


3610 


820097 


714 



346395 


60031 


472346 


121232 


301491 


146317 


482545 


69651 



TABLE III 



303 



VII. Conclusion 

We have presented the MLMR policy for the problem of 
learning combinatorial matchings of users to resources when 
the reward process is Markovian. We showed that this policy 
requires only polynomial storage and computation per step, 
and yields a regret that grows uniformly logarithmically over 
time and only polynomially with the number of users and 
resources. 

In future work, we would like to also consider the case 
when the rewards evolve not just when a user-resource pair is 
selected, but rather at each discrete time. Further, we would 
like to investigate if it is possible to analyze regret with respect 
to the best non-static policy, which would be a stronger notion 
of regret than that considered in this paper but is much harder 
to analyze. Finally, exploring distributed schemes is also of 
interest, though likely to be highly challenging in case of 
limited information exchange between users. 
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